Evaluating Various Tokenizers for Arabic Text Classification
نویسندگان
چکیده
The first step in any NLP pipeline is to split the text into individual tokens. most obvious and straightforward approach use words as However, given a large corpus, representing all not efficient terms of vocabulary size. In literature, many tokenization algorithms have emerged tackle this problem by creating subwords, which turn limits size corpus. Most techniques are language-agnostic, i.e., they do incorporate linguistic features language. Not mention difficulty evaluating such practice. paper, we introduce three new for Arabic compare them other popular tokenizers using unsupervised evaluations. addition, six on supervised classification tasks: sentiment analysis, news poem-meter classification, publicly available datasets. Our experiments show that none best choice overall performance algorithm depends factors including dataset, nature task, morphology richness dataset. some better compared others various tasks.
منابع مشابه
Evaluating Text Clustering Methods for Text Classification
In this project report, I will evaluate the several text clustering approaches and how they can be used for the purpose of text classification. The particular task is topic classification of 20 Newsgroup dataset and sentiment classification restaurant reviews dataset. Future direction for improving the results will also be discussed.
متن کاملText Summarization as Feature Selection for Arabic Text Classification
Text classification (TC) or text categorization task is assigning a document to one or more predefined classes or categories. A common problem in TC is the high number of terms or features in document(s) to be classified (the curse of dimensionality). This problem can be solved by selecting the most important terms. In this study, an automatic text summarization is used for feature selection. S...
متن کاملHigh capacity steganography tool for Arabic text using 'Kashida'
Steganography is the ability to hide secret information in a cover-media such as sound, pictures and text. A new approach is proposed to hide a secret into Arabic text cover media using "Kashida", an Arabic extension character. The proposed approach is an attempt to maximize the use of "Kashida" to hide more information in Arabic text cover-media. To approach this, some algorithms have been des...
متن کاملA Comparative Study on Arabic Text Classification
This paper focuses on Automatic Arabic classifications. Arabic language is highly inflectional and derivational language which makes text mining a complex task. In classifying Arabic text, there are many published experimental results. Since these results came from different datasets, authors and evaluation metrics, we cannot compare the performance of the experimented classifiers. In this pape...
متن کاملArabic Text Classification Using Support Vector Machines
Text classification (TC) is the process of classifying documents into a predefined set of categories based on their content. Arabic language is highly inflectional and derivational language which makes text mining a complex task. In this paper we applied the Support Vector Machines (SVM) model in classifying Arabic text documents. The results compared with the other traditional classifiers Baye...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Neural Processing Letters
سال: 2022
ISSN: ['1573-773X', '1370-4621']
DOI: https://doi.org/10.1007/s11063-022-10990-8